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Abstract 

Background: Data on single-nucleotide polymorphisms (SNPs) have been found to be useful in predicting 
phenotypes ranging from an individual's class membership to his/her risk of developing a disease. In multi-class 
classification scenarios, clinical samples are often limited due to cost constraints, making it necessary to determine the 
sample size needed to build an accurate classifier based on SNPs. The performance of such classifiers can be assessed 
using the Area Under the Receiver Operating Characteristic {ROC) Curve {AUG) for two classes and the Volume Under 
the ROC hyper-Surface (VUS) for three or more classes. Sample size determination based on AUC or VUS would not 
only guarantee an overall correct classification rate, but also make studies more cost-effective. 

Results: For coded SNP data from D(> 2) classes, we derive an optimal Bayes classifier and a linear classifier, and 
obtain a normal approximation to the probability of correct classification for each classifier. These approximations are 
then used to evaluate the associated AUCs or VUSs, whose accuracies are validated using Monte Carlo simulations. We 
give a sample size determination method, which ensures that the difference between the two approximate AUCs (or 
VUSs) is below a pre-specified threshold. The performance of our sample size determination method is then illustrated 
via simulations. For the HapMap data with three and four populations, a linear classifier is built using 92 independent 
SNPs and the required total sample sizes are determined for a continuum of threshold values. In all, four different 
sample size determination studies are conducted with the HapMap data, covering cases involving well-separated 
populations to poorly-separated ones. 

Conclusion: For multi-classes, we have developed a sample size determination methodology and illustrated its 
usefulness in obtaining a required sample size from the estimated learning curve. For classification scenarios, this 
methodology will help scientists determine whether a sample at hand is adequate or more samples are required to 
achieve a pre-specified accuracy. A PDF manual for R package "SampleSizeSNP" is given in Additional file 1, and a ZIP 
file of the R package "SampleSizeSNP" is given in Additional file 2. 
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Background 

Data on single-nucleotide polymorphisms (SNPs) have 
been found to be useful in predicting an individuals class 
membership or his/her response to a drug, susceptibil- 
ity to environmental factors such as toxins, and the risk 
of developing a particular disease, among others [1-5]. 
The classification literature provides a variety of classi- 
fiers (e.g., Support Vector Machine, genetic programming, 
Neural Networks and Logistic Regression) and sample 
size determination methods [6-10], but most of these are 
only applicable to continuous data. 

Recently Liu et al. [11] developed an optimal Bayes 
classifier and a linear classifier for coded SNP data from 
two classes, and obtained a normal approximation to 
the probability of correct classification (PCC) for each 
classifier. They also proposed a sample size determina- 
tion methodology to determine an adequate sample size, 
which ensures that the difference between the two approx- 
imate PCCs is below a pre-specified threshold value. 
Using Monte Carlo simulations, Liu et al [11] assessed 
the validity of their approximations. Furthermore, they 
illustrated the performance of their sample size determi- 
nation method via simulations and a real data analysis 
using the HapMap data on two populations — Chinese and 
Japanese. 

While Liu etal. [11] showed that their sample size deter- 
mination method is competitive, they also pointed out 
that an additional maximization step is required in order 
to determine the discrimination values for each of their 
classifiers; see their REMARK1 in their article for more 
details. When there are three or more classes, however, 
determination of such discrimination values is not only 
more difficult, but also increases the overall computa- 
tional burden. In a two-class scenario, a well known way to 
overcome this difficulty is to consider the Receiver Oper- 
ating Characteristic (ROC) curve, which plots the True 
Positive Rates vs. False Positives Rates, at various dis- 
crimination values [12,13]. Note that the ROC allows the 
discrimination value to be varied and it simultaneously 
explores all possible combinations of the correct classifi- 
cation rates [14]. The Area Under the ROC curve (AUC) is 
commonly used as a scalar performance measure, which 
allows classifiers to be compared independent of the dis- 
crimination values. Unfortunately, the AUC measure is 
only applicable to a two-class scenario. A popular exten- 
sion of the AUC measure, known as the Volume Under 
the ROC hyper-Surface (VUS) measure, is often used in 
a multi-class scenario (see e.g., Landgrebe and Duin [14] 
and Landgrebe and Paclik 2010 [15]). 

This article revisits the problem of sample size deter- 
mination in classification scenarios involving coded SNP 
data, but uses the AUC and the VUS as performance mea- 
sures for two-class and multi-class scenarios, respectively. 
More specifically, for coded SNP data from D(> 2) 



classes, we derive an optimal Bayes classifier and obtain a 
normal approximation to its probability of correct classi- 
fication, which is denoted by PCC(oo). We also derive a 
linear classifier and obtain a normal approximation to its 
probability of correct classification, which is denoted by 
PCC(n). For an overall assessment of each of the classi- 
fiers, we define the scalar measures AUC (for two-class) 
and VUS (for multi-class), and correspondingly define 
the quantities AUC(oo), AUC(n), VUS(oo) and VUS(n) 
for each classification scenario. For the two-class sce- 
nario, we propose to determine the sample size n for 
which AUC(oo) — AUC(n) < y, where y e (0,1) is 
a pre-specified threshold value. Whereas, for the multi- 
class scenario, we propose to determine the sample size 
n for which VUS(oo) — VUSin) < y. A computational 
method to determine the total sample size for various 
values of y is described. Monte Carlo simulations are car- 
ried out to corroborate our theoretical approximations, 
and the performance of our sample size determination 
method is assessed via simulations and analysis of the 
HapMap data consisting of 3 and 4 populations, respec- 
tively. In all, four different sample size determination 
studies are conducted with the HapMap data, cover- 
ing cases involving well-separated populations to poorly- 
separated ones. Details are given in the data analysis 
section. 

R software was used to carry out all the computa- 
tions. A PDF manual for R package "SampleSizeSNP" is 
given in Additional file 1, and a ZIP file of the R package 
"SampleSizeSNP" is given in Additional file 2. 

Methods 

Assumptions 

Suppose there are D(> 2) distinct classes denoted by 
Ci, . . . , Co, consisting of n\, . . . , subjects, respectively. 
For each subject, we observe a ^-dimensional SNP vec- 
tor, x = (xi,X2, :,x p ) f , where typically p is much larger 
(> >) than J2?=i n i> an d the ;th SNP is coded in such a way 
Xj = 0, 1, 2, which denotes the number of minor alleles in 
the genotype "aa", "Aa" and "AA", respectively. It is possi- 
ble that some of the SNPs are highly correlated, leading us 
to choose one SNP to represent a set of highly correlated 
ones. For classification and sample size determination, we 
make the following assumptions: 

1. For an m such that Ylf=i n i << m < P> tne data 
vector x = (x\, . . . , x m ) ! consists only of m SNPs, 
which are statistically independent. That is, the rest 
of the (p — m) correlated SNPs are not used for 
classification. 

2. For each k = 1, . . . , D and j — 1, . . . , m, we postulate 
Hardy- Weinberg equilibrium, according to which 
the probability mass function of the coded SNP (Xj) 
belonging to class k is given by 
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P k (Xj = xj\O k)J ) = Q^Ja " 0 k>J 



f-V, * ; . = 0,1, 2, 



where 0 k> j is the minor allele frequency at locus j in 
class /<, and by definition 0 kj j e (0.01, 0.5). Here, 
0 k} j < 0.5 because it is the minor allele frequency, 
and 0 k} j > 0.01 ensures that the polymorphism is not 
a mutation. For each k = 1, . . . , D, let 
Q k = {6 ki i, . . . , 0 k>m y denote the parameter vector 
corresponding to the class C k . 
3. There is a percentage p of the m SNPs with marginal 
effect on any two classes, and let / = Ipmj be the 
number of SNPs with marginal effects. 

The optimal classifier and its PCC 

By the assumptions above, the conditional mass function 
of X = (X\, . . . ,Xi) ! given the class Q, k = 1, . . . , A is 

/,(i = ^)=n{Q^(i- % ) 2 -^}. 

Suppose 7t k = P(x g C k ) and we denote the marginal mass 
function f(x) = Ylk=i 7T lfk(x\O k ) ) then for each 1 < k < 
D, the posterior mass function of the class C k given x is 



V<(O k \x) = 



Xkfk(*\0k) 
fix) 



For any fixed k = 1, . . . , A the Bayes classification rule 
then classifies x to the class C k if 



r k {O k \x) 



> 1 



(1) 



T k >(O k >\x) 

for all k' 7^ /c. This leads to the optimal Bayes classifier, 
which classifies x to C k if 



7=1 

for all /r r 7^ /<, where 

^' = log (^o^) 

= log(^) + 21og| 



(2) 



and 7<^/ 



Then, the PCC of the optimal Bayes classifier is defined as 



pcc(oo)=j2^kP\ n 

k =l \k'^k 



Ys^Xj > I<k,k> 
7=1 



In Additional file 3: Appendix 1, we derive a normal 
approximation for PCC(oo), as / —> 00. That is, for large 
/, we show that 

D poo 

PCC '(00) « V 7^ / 0 (x; £/,*) dx, (4) 



where 0 is the (D — 1) -dimensional multivariate normal 
density, is a multiple integral, K k and /t/ ^ are (D — 1) x 
1 vectors, and E/^ is a (D — 1) x (D — 1) matrix. All these 
quantities are defined in Additional file 3: Appendix 1. 

In Additional file 3: Appendix 4, we give an expression 
for (4) for the case D = 3. 

A linear classifier and its PCC 

Motivated by the form of the optimal Bayes classifier in 
(2), we consider the following linear classifier that classi- 
fies x to the class C k if 



^y^Wj^kJ^Xj > k kM > 

7=1 



(5) 



for all k f ^ k, where V kk , = log( ^ (1 4,/ and fyy are 

the maximum likelihood estimators of 0 k) j and respec- 
tively. Also, the values of the weights Wj f yi(k, k r ) in (5) are 
determined in the following way: For each ; = 1, . . . , m 
and k! 7^ k, suppose we test the hypothesis H^ j : 0 k> j = 
6 k 'j versus H\f : 0 kfJ ^ 6 k t tj . Then w J>n (k,k f ) = 1 if 

Hq'j is rejected; else Wj >n (k,k f ) = 0. In Additional file 
3: Appendix 2, we use the large sample theory to derive 
a Wald test of level a to test H^'f versus H^f , and an 

expression for the power, 1 — (n k , n k ', hj), of this test, 
when 6 k j — 6 k >j = hj. 

In Additional file 3: Appendix 3, we derive a normal 
approximation for the PCC of the linear classifier, denoted 
by PCCin). That is, for large /, we show that 



D r°° / \ 
PCCin) « 71 k L 4 Hk> ^i,k) d * 
k=i Jk * 



(6) 



Note that PCCin) depends on n = (m, . . . , npY through 
(}i m j 0 i m ,k)> see Additional file 3: Appendix 3 for details. 
In Additional file 3: Appendix 4, we give an expression for 
(6) for the case D = 3 . 

AUC and VUS for the optimal and linear classifiers 

For any (/<, k f ), define 

^ k k , = p (Classify X to C k > \X e C k ) . 

Then, for the optimal Bayes classifier in (2) we have from 
(4) that 



pOO 



(7) 



and similarly, for the linear classifier in (5), we have from 
(6) that 



k=i 



h,k ^ 0 Hk> ^Uk) d *> 



(8) 
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for k = l,...,D. When D = 2, for the optimal 
Bayes classifier, the ROC(oo) for two classes is the curve 
§2,2 vs. (1 - § u ). Then, the AUC(oo) is 



AUC(oo) = j §2,2^1,1. 



However, when the number of classes D > 3, we need 
to consider the volume under the ROC hypersurface. Fol- 
lowing the work of Landgrebe and Duin [14], the VUS is 
defined as 

VUS(OO) = j ... j &,Dd%l,l&,2 • • • §(D-1),(D-1) (9) 
3 (§1,1' §2,2> • • • > §(£>- !),(£>-!)) 



= /.../&. 



a(Ai,jr 2 ,...,xb-i) 



dK\ . . . dKz)—i. 



By replacing ^ by [see (8)] in the above defini- 
tions of ROC, AUC and the VUS, we obtain corresponding 
expressions for the linear classifier in (5). We denote the 
resulting ones as AUCin) and VUSin). In Additional file 
3: Appendix 4, we derive these expressions for the case 
D = 3 . 

Computation of VUS 

As is evident from (9), the computation of VUS involves 
high dimensional integration. Given below is a brief 
description of the steps involved in the computation of 
VUS. For ease of exposition, we will denote ^ = k = 
1,...,D. First, we randomly generate the thresholds K = 
(7<i,/<2, . . . ,/<d_i) (see (9)) and compute the correspond- 
ing | = (fi, satisfying (7). Note that the £ 
contributes to the integration in VUS only if all the §//s are 
positive. 

To find as many \ values that contribute to the inte- 
gration as possible, we use the ant colony optimization 
algorithm, where only the K values corresponding to the 
| values that contribute to the integration are retained. 
However, these are perturbed by a small noise and the 
resulting K values are used as seeds for the next itera- 
tion. Then, we use the genetic algorithm to obtain another 
| value located in a different region within (0, 1)^, which 
also contributes to the integration. We use the ant colony 
algorithm and the genetic algorithm alternatively to even- 
tually generate a dense set of f (e (0, l) k ) values that 
contribute to the integration. Note that the process is such 
that the newly generated § values are appended to all the 
previously generated f values. 

Now, to compute the volume, VUS(oo), we use the con- 
vhulln function in the qhull ^-package. Note that the 
convhulln function is designed to determine the convex 
hull of a set of D-dimensional points and thus compute 
the volume of the hull. In view of this, in order to com- 
pute the volume, VUS(oo), a base of f (this is same as the 
| vector, except that one of its components, e.g. the first 
component, is set to 0) is appended to the original §. Since 



in each iteration the new f values are appended to the old 
| values from the previous iterations, and the VUS is con- 
cave, the computed VUS is supposed to increase in value 
with each iteration. We stop appending the new § values 
when \ VUS 0 id — VUS new \ < 0.001. When this criterion 
is satisfied, we obtain the value of VUS(oo). Similarly, the 
values of AUC (oo), AUC (n), and VUS(n) are calculated. 

Sample size determination using VUS oxAUC 

Given a threshold y, we determine the sample size n 
satisfying the following condition: 



raS(oo) - VUSin) < y 



(10) 



For the case D = 2, we determine the sample size n satisfy- 
ing the condition: AUC(oo) — AUCin) < y . A simulation 
study for the case D = 2 is carried out in Additional file 3: 
Appendix 5 to assess the performance of our sample size 
determination algorithm. 

Results 

Monte Carlo simulations 

Before we illustrate the performance of our sample size 
determination method based on AUC or VUS, we present 
results from an extensive Monte Carlo simulation study 
conducted to verify the accuracy of the approximations for 
AUCin) and VUSin), respectively, and study their behav- 
ior as a function of n and other parameters. Here, we 
present the numerical assessments based on the VUS for 
the cases D = 3 and 4, respectively. However, as men- 
tioned above, the assessments based on the AUC for the 
case D = 2 are given in Additional file 3: Appendix 5. 
Henceforth, we will set n^ = n for all k = 1, . . . , D, and we 
will use n instead of n to simplify notations. 

When D = 3, we consider the following simulation set 
up: For <9i = (<9 U , . . . , 6 hm ) f , let Oy - U(0A, 0.49), ; = 
1, . . . , m; for a specified scalar value h, let hi, hi be such 
that their components h if j ~ U(h - 0.002, h + 0.002), i = 
1, 2; j = l,...,m; and let fa = 0i - h, 0 3 = 0 2 - h 2 . 
First, we generated a ifii, §2, #3) according to the above set 
up, and then generated the data vector x = (xi, . . . ,x m ) f 
for each class. We then computed VUS(oo) and VUSin) 
following the computational methodology described ear- 
lier. For this (61,62,03), we then drew twenty x data 
sets and calculated a Monte Carlo estimate, denoted by 
VUS(n)MC. This process was repeated 20 times and an 
average value of VUS(n)MC was computed. These are 
given in Table 1. It is evident from Table 1 that the Bias = 
VUS(n)MC — VUS(n) is negligible in most cases, which 
validates the use of our approximation for VUS(n). Table 1 
also gives similar results for the case D = 4. Note that 
VUS (00) = 1/Dl for a random classifier, which is the 
lower bound of VUS (00) for any classifier. 
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Table 1 Performance of optimal and linear classifiers 











D = 3 






h 




H 


VUS(oo) 


VUS{n) 


VUS(n)MC 


Bins 


0.02 


50 


50 


0.3013 


0.1 772 


0.1657 


-0.01 16 


0.02 


50 


100 


0.3015 


0.1793 


0.1742 


-0.0052 


0.02 


100 


50 


0.3662 


0.1807 


0.1874 


0.0067 


0.02 


100 


100 


0.366 


0.1837 


0.1974 


0.0136 


0.05 


50 


50 


0.5469 


0.2229 


0.2442 


0.0213 


0.05 


50 


100 


0.5467 


0.2517 


0.2845 


0.0328 


0.05 


1 00 


50 


0.6988 


0.2448 


0.2912 


0.0463 


0.05 


100 


100 


0.6987 


0.2848 


0.3377 


0.0529 


0 1 


50 


50 


u.ouou 


041 79 


0 4675 


0 0496 


0.1 


50 


1 00 


0.8687 


0.4958 


0.55 


0.0542 


0.1 


1 00 


50 


0.9667 


0.4776 


0.5342 


0.0566 


0.1 


100 


100 


0.9667 


0.5692 


0.6341 


0.0649 










D = 4 






h 


m 


n 


VUS(oo) 


VUS{n) 


VUS(n)MC 


Bins 


0.02 


50 


50 


0.1319 


0.048 


0.0462 


-0.0018 


0.02 


50 


100 


0.1318 


0.05 


0.0512 


0.0013 


0.02 


100 


50 


0.1892 


0.0503 


0.057 


0.0068 


0.02 


100 


100 


0.189 


0.0531 


0.0614 


0.0082 


0.05 


50 


50 


0.3891 


0.0893 


0.0923 


0.003 


0.05 


50 


100 


0.3893 


0.1175 


0.1144 


-0.0032 


0.05 


100 


50 


0.5832 


0.1092 


0.1127 


0.0034 


0.05 


100 


100 


0.5831 


0.1458 


0.1285 


-0.0174 


0.1 


50 


50 


0.8376 


0.2933 


0.2705 


-0.0228 


0.1 


50 


100 


0.8378 


0.4059 


0.3517 


-0.0542 


0.1 


100 


50 


0.9623 


0.3653 


0.3119 


-0.0534 


0.1 


100 


100 


0.9626 


0.4962 


0.4085 


-0.0877 



Here, D = 3 and 4, 0 } = (0 1(1 6>i, m )', let 6y ~ (7(0.4, 0.49) J =1 m; for a 

specified scalar value h, let h] , h 2 , h 3 be such that their components hjj ~ U 

(h - 0.002, h + 0.002),; =1 m; and let 0 i+] = 0; - h h i = 1 , 2, 3; n is the 

sample size for each class; m is the number of independent SNPs, a = 0.01 is the 
significant level for Wald tests; and p = 1 is the percentage of the significant 
SNPs. 



Next, we determine the smallest n such that f(ri) = 
VUS(oo) — VUS(n) — y < 0, for a pre-specified y value. 
We use the following algorithm to determine such an n: (i) 
Let n = ns and ni such that/(Hs) > 0 and/(n^) < 0, and 
set yim =[ {ns + ni) /2]. The algorithm begins by selecting 
a small ns and a large ni\ (ii) \if(yiM)f(yi$) < 0, then reset 
Hi = yim\ or else, reset ns — nM- In either case, return 
to step (i), unless ni — ns < 1, in which case, the small- 
est sample n — m\ (iii) Use the smallest (total) sample 
of size D x mt with n = vll from each class, Q, . . . , Q> 
We implemented this algorithm for each value of /z, m 
and significance level a for the Wald test; see discussion 



below (5). For the cases D = 2 and D = 3, respectively, 
Table 2 displays the determined sample sizes for y = 0.01 
and each combination of parameter values. From Table 2, 
it is evident that the required sample size reduces as h 
increases, as expected. Hence, f(n) < 0 for smaller sam- 
ple sizes, as shown in Table 2. However, the effect of m on 
the determined sample sizes is less clear. When h is large, 
say h > 0.1, then the required sample size reduces as m 
becomes large. Whereas, when h is small, say h = 0.05, 
the reverse is true as m becomes large. 

Application to the HapMap data 

The aim of the International HapMap Project is to develop 
a haplotype map of the human genome, which will 
describe the common patterns of human DNA sequence 
variation. 

The HapMap data (Phase III) consists of eleven 
populations with about p = 1.2 x 10 6 SNPs. Here, 
we consider the following nine populations in order 
to illustrate our sample size determination algorithm: 
ASW— African ancestry in Southwest USA with 87 sub- 
jects; CEU— Utah residents with Northern and Western 
European ancestry from CEPH collection with 167 sub- 
jects; CHB — the Han Chinese individuals from Beijing 
with 137 subjects; CHD — Chinese in Metropolitan 
Denver, Colorado with 109 subjects; GIH— Gujarati 
Indians in Houston, Texas with 101 subjects; JPT— 
the Japanese individuals from Tokyo with 113 subjects; 
MEX — Mexican ancestry in Los Angeles, California with 
86 subjects; TSI — Toscans in Italy (TSI) with 102 subjects; 
and YRI— Yoruba in Ibadan, Nigeria with 203 subjects. 
With these, we created four sample size determination 
studies, of which the first three involve three populations 
(D = 3), and the last study involves four populations (D = 
4). More specifically, we conducted our sample size deter- 
mination studies with the following population groupings: 



Table 2 Sample size determination: here, D = 3 and 4, and 
n is the sample size for each class satisfying: 

VUS(oo) - VUS(n) <y(= 0,01) 



n 



D 


h 


m = 30 


m = 50 


m = 100 


m = 200 


3 


0.05 


1957 


2040 


2091 


2040 


3 


0.1 


489 


475 


412 


288 


3 


0.15 


189 


161 


105 


69 


4 


0.05 


1923 


2051 


2137 


2122 


4 


0.1 


490 


476 


417 


297 


Here, 6 } 


= (01J,. V 




- (7(0.4, 0.49) j = 


1 , . . . , m; for a specified scalar 



value h, let h ]r h 2 , h 3 be such that their components h-,j ~ U(h - 0.002, h + 0.002), 
j = 1 , . . . , m; and let Oj+y = 0; — h h i = 1 , 2, 3; m is the number of independent 
SNPs, a = 0.01 is the significant level for Wald tests; and p = 1 is the percentage 
of the significant SNPs. 
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Figure 2 Total sample sizes needed for classification to 
moderately-separated HapMap populations ASW, TSI, and YRI. 

For the linear classifier based on the SNP data from the three 
populations, the estimated learning curve gives the required total 
sample size for different values of the threshold, y, satisfying 
VUSloo) - VUS(n) < y. Here, p = 1 , a = 0.1 , m = 92, and 
VUS(oo) = 0.7557. 



(I) (CEU, GIH, MEX); (II) (ASW, TSI, YRI); (III) (CHB, 
JPT, CHD); and (IV) (CHB, JPT, CHD, GIH). 

Based on all the available subjects, we extracted pair- 
wise independent SNPs using the following steps. Suppose 
L is a set of SNPs, then: (I) form a set S with one SNP from 
L and update S after the next step; (II) from the remaining 
SNPs in L, choose one SNP that is independent of every 
SNP in S using Kendalls r coefficient as a test statistic to 
test pair-wise independence, and then add this new SNP 
to S. Here, we concluded independence if the Kendalls 
t -value < 0.05; (III) Repeat (II) until each remaining SNP 
in L is correlated with at least one SNP in S. This proce- 
dure yielded a set S with m = 92 pair-wise independent 
SNPs, and with these we built our linear classifier. 

Next, we set p — 1 so that m = I = 92; see 
Assumption 3 under the Methods section. Recall that 0# = 
(Ok,i> •••> Ok,lY for k= 1, . . . ,D. For the cases D = 3 and 
D = 4 considered in studies (I) to (IV) above, we esti- 
mated Ok using the maximum likelihood (ML) estimates 
obtained based on all the available subjects belonging to 
the respective populations. We then substituted these ML 
estimates into the corresponding expressions for VUS(oo) 
and VUS(n), respectively. Figures 1, 2 and 3 show plots of 
required sample sizes for a continuum of threshold val- 
ues y f° r the case D = 3 considered in studies (I) to (III), 
respectively, and Figure 4 plots the same for D = 4 consid- 
ered in study (IV). From these figures, the required total 
sample size can be determined approximately for each 
pre-specified y value. 

For example, if we set y = 0.10 (i.e., VUS(oo) — 
VUS{n) < 0.10), then in the three population (CEU,GIH, 
MEX) case, the VUS(oo) = 0.9046 and about 62 obser- 



vations are required for each class with a total sample 
size of 186, whereas in the three population (ASW, TSI, 
YRI) case, the VUS(oo) = 0.7557 and about 150 observa- 
tions are required for each class with a total sample size 
of 450. Note that, for y = 0.10, in study (I) the required 
sample sizes for each population is less than what is cur- 
rently available, whereas in study (II), we would need 63 
and 48 more observations for the populations ASW and 
TSI, respectively. For the three population (CHB, JPT and 
CHD) case, if we set y = 0.10 then the VUS(oo) = 
0.6178 and about 244 observations are required for each 
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Figure 1 Total sample sizes needed for classification to 
well-separated HapMap populations CEU 7 GIH, and MEX. For the 

linear classifier based on the SNP data from the three populations, the 
estimated learning curve gives the required total sample size for 
different values of the threshold, y, satisfying VUS(oo) - VUS(n) < y. 
Here,p = \ ,a = 0.1, m = 92, and VUS(oc) = 0.9046. 
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Figure 3 Total sample sizes needed for classification to 
poorly-separated HapMap populations CHB 7 JTP 7 and CHD. For 

the linear classifier based on the SNP data from the three populations, 
the estimated learning curve gives the required total sample size for 
different values of the threshold, y, satisfying VUS(qo) - VUS(n) < y. 
Here,p = 1,a = 0.1, m = 92, and VUS\oc) = 0.6178. 
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Figure 4 Total sample sizes needed for classification to majority 


poorly-separated HapMap populations CHB, JTP, CHD and GIH. 


For the linear classifier based on the SNP data from the three 


populations, the estimated learning curve gives the required total 


sample size for different values of the threshold, y, satisfying 


VUS(oo) - 


- VUS(n) < y.Here, p = 1,a = 0.1, m = 92, and 


VUS(oc) = 


= 0.5580. 



class with a total sample size of 732. Clearly, for study 
(III) at least 100 more observations are needed for each 
population (CHB, JPT and CHD) when y = 0.10. Finally, 
for the four population (CHB, JPT, CHD, GIH) case, set- 
ting y = 0.10 yields that the VUS '(oo) = 0.5580 and about 
279 samples are required for each class with a total sample 
of 1,116. Once again, at least 150 more observations are 
needed for each of the four populations when y = 0.10. 

The results from the four HapMap studies suggest that 
the VUS(oo) value is large and the required total sam- 
ple size is small when the populations are well-separated 
[as in study (I)]. Whereas, when the populations are 
moderately -separated [as in study (II), where the popula- 
tions ASW and YRI may be similar], the VUS(oo) value 
reduces and the required total sample size increases mod- 
erately. When the populations are poorly-separated [as in 
study (III), where all the three populations may be sim- 
ilar], the VUS(oo) value reduces even further and there 
is a substantial increase in the required total sample size. 
Finally, in the four population study, where three of the 
populations are poorly -separated, once again we see a fur- 
ther reduction in the VUS(oo) value and a corresponding 
increase in the required total sample size. Although not 
reported here, we also considered other well-/moderate- 
/poorly- separated cases with the HapMap data and 
observed similar results as the ones reported here. 

It is well known in the classification literature that the 
performance of a classifier depends on how well sepa- 
rated the classes are. Similarly, the studies above involv- 
ing the HapMap data show that the performance of our 
sample size determination methodology also depends on 
the extent of separation between populations. While our 



methodology provides a formal way of determining an 
approximate total sample size for each specified value of 
y, it is clear from the HapMap data analysis that it is 
not possible to propose a universal y value. Nevertheless, 
if the classes are well-separated or moderately -separated, 
then we believe that y =0.10 may be a good choice for 
many frequently encountered data sets in classification 
problems. 

Discussion 

We have built an optimal Bayes classifier and a linear 
classifier based on coded SNP data from two or more 
classes. For these classifiers, we have considered the two 
commonly used scalar performance measures, the Area 
Under the ROC curve (AUC) and the Volume Under the 
ROC hyper-Surface {VUS), which allow classifiers to be 
compared independent of discrimination values. We have 
illustrated the performance of a sample size determina- 
tion methodology, which selects the smallest total sample 
size n such that the criterion VUS(oo) — VUS(n) < y 
is satisfied. While the approximations to the VUS (or 
AUC) obtained here provide the necessary theoretical jus- 
tification, the simulations and the HapMap data analysis 
presented here illustrate the practical value of our sample 
size determination method. 

The fact that the HapMap contains data on multi- 
ple populations belonging to similar or dissimilar geo- 
graphical locations enabled us to test the performance 
of our sample size determination method on three 
different multi-class scenarios involving well-separated, 
moderately -separated, and poorly -separated populations. 
We have shown that the the extent of separation between 
the populations and the choice of threshold value affect 
the total sample size required to satisfy the criterion. With 
regard to the choice of the threshold value y in other 
practical contexts, we recommend that the user take into 
consideration the cost of obtaining more samples and 
choose an appropriate value of y that gives an acceptable 
precision. In other words, if the cost of sampling is afford- 
able then the user may want to sample more to achieve a 
higher precision (lower y value) using our classifier; oth- 
erwise, the user has to settle for a higher y value that 
makes use of all the available samples. We also infer from 
our HapMap data analysis that a value of VUS(oo) > 0.80 
may indicate the extent of separation between the classes. 
Thus, the value of VUS(oo) could also give some prior 
guidance on the choice of y values, especially in instances 
where the cost of sampling is a serious concern. 

Conclusion 

In summary, for multiple classes, we have developed an 
asymptotic methodology based on AUC or VUS to esti- 
mate the learning curve of SNP classifiers. It is shown 
that the required total sample size can be obtained 
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from the estimated learning curve for each pre-specified 
threshold value. In classification problems, sample size 
determination is important due to cost considerations. 
This methodology will help scientists determine if a sam- 
ple at hand is adequate or more observations are neces- 
sary to achieve a pre-specified accuracy, and thus help 
users strike an optimal balance between precision and 
cost. 
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